最近的几项研究指出,现有的视觉问题回答(VQA)模型严重遭受了先前的问题的困扰,这是指捕获问题类型和答案之间的表面统计相关性,而忽略了图像内容。通过创建精致的模型或引入额外的视觉注释,已经致力于加强图像依赖性。但是,这些方法无法充分探索视觉提示如何显式影响学习的答案表示,这对于减轻语言的依赖至关重要。此外,他们通常强调对学习的答案表示形式的班级歧视,这忽略了更精细的实例级别模式,并要求进一步优化。在本文中,我们从视觉扰动校准的角度提出了一种新颖的协作学习方案,该方案可以更好地研究细粒度的视觉效果,并通过学习实例级别的特征来减轻语言的先验问题。具体而言,我们设计了一个视觉控制器来构建具有不同扰动范围的两种策划图像,基于该图像的协作学习内置不变性和实体歧视的协作学习由两个精心设计的歧视者实现。此外,我们在潜在空间上实施信息瓶颈调制器,以进一步减轻偏见和表示校准。我们将视觉扰动感知框架强加于三个正统基准,并将实验结果对两个诊断性VQA-CP基准数据集进行了实验结果,显然表明了其有效性。此外,我们还证明了它在平衡的VQA基准上的鲁棒性是合理的。
translated by 谷歌翻译
视觉问题回答(VQA)本质上是从根本上组成的,许多问题仅通过将它们分解为模块化子问题就可以回答。最新提出的神经模块网络(NMN)采用此策略来问答案,而在现成的布局解析器或有关网络体系结构设计的其他专家政策中,而不是从数据中学习。这些策略导致对输入的语义复杂差异的适应性不令人满意,从而阻碍了模型的表示能力和概括性。为了解决这个问题,我们提出了一个语义吸引的模块化胶囊路由框架,称为Super,以更好地捕获特定实例的视觉 - 语义特征并完善预测的判别性表示。特别是,在超级网络的每一层中都定制了五个功能强大的专用模块以及动态路由器,并构造了紧凑的路由空间,使得可以充分利用各种可自定义的路由,并且可以明确校准视觉声称表示。我们相对证明,我们提出的超级方案在五个基准数据集以及参数效率优势上的有效性和概括能力合理。值得强调的是,这项工作不是在VQA中追求最先进的结果。取而代之的是,我们希望我们的模型有责任为VQA提供建筑学习和表示校准的新颖观点。
translated by 谷歌翻译
视觉问题的视觉关注在视觉问题上应答(VQA)目标在定位有关答案预测的右图像区域,提供强大的技术来促进多模态理解。然而,最近的研究指出,来自视觉关注的突出显示的图像区域通常与给定的问题和答案无关,导致模型混淆正确的视觉推理。为了解决这个问题,现有方法主要是为了对准人类关注的视觉注意力。尽管如此,收集这种人类数据是费力且昂贵的,使其在数据集中调整良好开发的模型。为了解决这个问题,在本文中,我们设计了一种新的视觉关注正规化方法,即attreg,以便在VQA中更好地视觉接地。具体而言,attraT首先识别了由骨干模型出乎意料地忽略(即,分配低注意重量)的问题所必需的图像区域。然后,利用掩模引导的学习方案来规范视觉注意力,以便更多地关注这些忽略的关键区域。所提出的方法是非常灵活的,模型不可知,可以集成到基于大多数基于视觉关注的VQA模型中,并且不需要人类注意监督。已经进行了三个基准数据集,即VQA-CP V2,VQA-CP V1和VQA V2的广泛实验,以评估attreg的有效性。作为副产品,将Attreg纳入强基线LMH时,我们的方法可以实现新的最先进的准确性为60.00%,在VQA-CP V2基准数据集上绝对性能增益为7.01%。 。
translated by 谷歌翻译
We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO objective constrained to the unit ball in $\mathbb{R}^d$, we obtain the following results (up to polylogarithmic factors). We give a parallel algorithm obtaining optimization error $\epsilon_{\text{opt}}$ with $d^{1/3}\epsilon_{\text{opt}}^{-2/3}$ gradient oracle query depth and $d^{1/3}\epsilon_{\text{opt}}^{-2/3} + \epsilon_{\text{opt}}^{-2}$ gradient queries in total, assuming access to a bounded-variance stochastic gradient estimator. For $\epsilon_{\text{opt}} \in [d^{-1}, d^{-1/4}]$, our algorithm matches the state-of-the-art oracle depth of [BJLLS19] while maintaining the optimal total work of stochastic gradient descent. We give an $(\epsilon_{\text{dp}}, \delta)$-differentially private algorithm which, given $n$ samples of Lipschitz loss functions, obtains near-optimal optimization error and makes $\min(n, n^2\epsilon_{\text{dp}}^2 d^{-1}) + \min(n^{4/3}\epsilon_{\text{dp}}^{1/3}, (nd)^{2/3}\epsilon_{\text{dp}}^{-1})$ queries to the gradients of these functions. In the regime $d \le n \epsilon_{\text{dp}}^{2}$, where privacy comes at no cost in terms of the optimal loss up to constants, our algorithm uses $n + (nd)^{2/3}\epsilon_{\text{dp}}^{-1}$ queries and improves recent advancements of [KLL21, AFKT21]. In the moderately low-dimensional setting $d \le \sqrt n \epsilon_{\text{dp}}^{3/2}$, our query complexity is near-linear.
translated by 谷歌翻译
New architecture GPUs like A100 are now equipped with multi-instance GPU (MIG) technology, which allows the GPU to be partitioned into multiple small, isolated instances. This technology provides more flexibility for users to support both deep learning training and inference workloads, but efficiently utilizing it can still be challenging. The vision of this paper is to provide a more comprehensive and practical benchmark study for MIG in order to eliminate the need for tedious manual benchmarking and tuning efforts. To achieve this vision, the paper presents MIGPerf, an open-source tool that streamlines the benchmark study for MIG. Using MIGPerf, the authors conduct a series of experiments, including deep learning training and inference characterization on MIG, GPU sharing characterization, and framework compatibility with MIG. The results of these experiments provide new insights and guidance for users to effectively employ MIG, and lay the foundation for further research on the orchestration of hybrid training and inference workloads on MIGs. The code and results are released on https://github.com/MLSysOps/MIGProfiler. This work is still in progress and more results will be published soon.
translated by 谷歌翻译
In the scenario of black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful adversarial perturbation based on query feedback under a query budget. Due to the limited feedback information, existing query-based black-box attack methods often require many queries for attacking each benign example. To reduce query cost, we propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability. Specifically, by treating the attack on each benign example as one task, we develop a meta-learning framework by training a meta-generator to produce perturbations conditioned on benign examples. When attacking a new benign example, the meta generator can be quickly fine-tuned based on the feedback information of the new task as well as a few historical attacks to produce effective perturbations. Moreover, since the meta-train procedure consumes many queries to learn a generalizable generator, we utilize model-level adversarial transferability to train the meta-generator on a white-box surrogate model, then transfer it to help the attack against the target model. The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance, which is verified by extensive experiments.
translated by 谷歌翻译
Cashews are grown by over 3 million smallholders in more than 40 countries worldwide as a principal source of income. As the third largest cashew producer in Africa, Benin has nearly 200,000 smallholder cashew growers contributing 15% of the country's national export earnings. However, a lack of information on where and how cashew trees grow across the country hinders decision-making that could support increased cashew production and poverty alleviation. By leveraging 2.4-m Planet Basemaps and 0.5-m aerial imagery, newly developed deep learning algorithms, and large-scale ground truth datasets, we successfully produced the first national map of cashew in Benin and characterized the expansion of cashew plantations between 2015 and 2021. In particular, we developed a SpatioTemporal Classification with Attention (STCA) model to map the distribution of cashew plantations, which can fully capture texture information from discriminative time steps during a growing season. We further developed a Clustering Augmented Self-supervised Temporal Classification (CASTC) model to distinguish high-density versus low-density cashew plantations by automatic feature extraction and optimized clustering. Results show that the STCA model has an overall accuracy of 80% and the CASTC model achieved an overall accuracy of 77.9%. We found that the cashew area in Benin has doubled from 2015 to 2021 with 60% of new plantation development coming from cropland or fallow land, while encroachment of cashew plantations into protected areas has increased by 70%. Only half of cashew plantations were high-density in 2021, suggesting high potential for intensification. Our study illustrates the power of combining high-resolution remote sensing imagery and state-of-the-art deep learning algorithms to better understand tree crops in the heterogeneous smallholder landscape.
translated by 谷歌翻译
In this paper, we develop an efficient multi-scale network to predict action classes in partial videos in an end-to-end manner. Unlike most existing methods with offline feature generation, our method directly takes frames as input and further models motion evolution on two different temporal scales.Therefore, we solve the complexity problems of the two stages of modeling and the problem of insufficient temporal and spatial information of a single scale. Our proposed End-to-End MultiScale Network (E2EMSNet) is composed of two scales which are named segment scale and observed global scale. The segment scale leverages temporal difference over consecutive frames for finer motion patterns by supplying 2D convolutions. For observed global scale, a Long Short-Term Memory (LSTM) is incorporated to capture motion features of observed frames. Our model provides a simple and efficient modeling framework with a small computational cost. Our E2EMSNet is evaluated on three challenging datasets: BIT, HMDB51, and UCF101. The extensive experiments demonstrate the effectiveness of our method for action prediction in videos.
translated by 谷歌翻译
Gaze estimation is the fundamental basis for many visual tasks. Yet, the high cost of acquiring gaze datasets with 3D annotations hinders the optimization and application of gaze estimation models. In this work, we propose a novel Head-Eye redirection parametric model based on Neural Radiance Field, which allows dense gaze data generation with view consistency and accurate gaze direction. Moreover, our head-eye redirection parametric model can decouple the face and eyes for separate neural rendering, so it can achieve the purpose of separately controlling the attributes of the face, identity, illumination, and eye gaze direction. Thus diverse 3D-aware gaze datasets could be obtained by manipulating the latent code belonging to different face attributions in an unsupervised manner. Extensive experiments on several benchmarks demonstrate the effectiveness of our method in domain generalization and domain adaptation for gaze estimation tasks.
translated by 谷歌翻译
As an important variant of entity alignment (EA), multi-modal entity alignment (MMEA) aims to discover identical entities across different knowledge graphs (KGs) with multiple modalities like images. However, current MMEA algorithms all adopt KG-level modality fusion strategies but ignore modality differences among individual entities, hurting the robustness to potential noise involved in modalities (e.g., unidentifiable images and relations). In this paper we present MEAformer, a multi-modal entity alignment transformer approach for meta modality hybrid, to dynamically predict the mutual correlation coefficients among modalities for instance-level feature fusion. A modal-aware hard entity replay strategy is also proposed for addressing vague entity details. Extensive experimental results show that our model not only achieves SOTA performance on multiple training scenarios including supervised, unsupervised, iterative, and low resource, but also has limited parameters, optimistic speed, and good interpretability. Our code will be available soon.
translated by 谷歌翻译